Skip to content

Conversation

@anivar
Copy link

@anivar anivar commented Aug 16, 2025

This PR fixes three critical issues preventing Server v2 from being used in production:

  1. URL prefix normalization - Now handles --url-prefix //path correctly like the old server
  2. Args file loading - Fixed .args being loaded too late
  3. Connection stability - Removed aggressive client dropping, fixed partial writes, increased buffer size

The changes are minimal (~60 lines) and follow patterns from the existing server implementation.

Fixes #767, #783, #787

@vlasky
Copy link
Contributor

vlasky commented Dec 4, 2025

@anivar please explain the consequences of the .args being loaded too late. What failure/issue does it cause?

Fixes four critical production issues in llamafile Server v2:

1. **Fix .args loading timing** (llama.cpp main/main.cpp)
   - Move cosmo_args() call before determine_program()
   - Ensures --server --v2 flags in .args are seen when determining program mode
   - Fixes mozilla-ai#783

2. **Add URL prefix normalization** (llamafile/flags.cpp)
   - Consolidate consecutive slashes (//api/v1 → /api/v1)
   - Ensure leading slash, remove trailing slash
   - Validate AFTER normalization
   - Use static std::string for proper lifetime management (no memory leak)
   - Fixes mozilla-ai#767

3. **Robust partial write handling** (llamafile/server/client.cpp)
   - Implement full write loop to handle partial writes correctly
   - Handle EINTR (signal interruption) gracefully
   - Properly detect connection closure
   - Increase file transfer buffer from 512B to 16KB for better performance

4. **Remove aggressive client dropping** (llamafile/server/worker.cpp)
   - Remove code that kills oldest active connection when all workers busy
   - Let TCP listen backlog naturally queue incoming connections
   - Provides better UX (graceful queuing vs abrupt disconnection)
   - Fixes mozilla-ai#787

All fixes improve upon original PR mozilla-ai#788 with better error handling
and no memory leaks.
@anivar anivar force-pushed the fix-server-v2-production-issues branch from 4746d30 to 78a2261 Compare December 4, 2025 16:58
@anivar
Copy link
Author

anivar commented Dec 4, 2025

@vlasky Great question! The timing issue causes a real production problem for anyone distributing llamafiles with embedded configuration.

Here's what happens: when a user embeds --server --v2 in their .args file and runs the llamafile without any CLI arguments, determine_program() executes before .args loads. At that point, argv is essentially empty, so the function picks the wrong mode (usually defaulting to chatbot). By the time .args finally loads and those flags become available, the program mode decision has already been made and the server never starts.

This completely breaks the "distribute a self-contained llamafile" use case - you can't ship a llamafile that's pre-configured to run as a server via .args, which defeats one of the main benefits of the format.

The fix is straightforward: load .args before calling determine_program() so the embedded flags are visible when making the mode decision.

While I was in there, I also improved the other fixes - the URL normalization now avoids a memory leak by using static storage, the partial write handler does a proper retry loop instead of just one attempt, and the file transfer buffer got bumped to 16KB for better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants